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(57) Abstract ^18 ^12 

A decision tree classifier (20) 
is designed using the hidden Markov 
model to yield a final classifier that 
can be implemented without using 

any mathematical operations. The ^ ^ , ^ . . _ ^ • i ♦ ^ 

apparatus of this invention moves all of the mathematical calculations to the construction of the decision tree. Once this is comply. 
Sc decision tree can be implemented using only logical operations such as memory addressing and binary companson operations. This 
simplicity allows the decision tree to be implemented in a simple hardware form using conventional gates. 
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DECIS10N TREE CLASSIFIER DESIGNED USING HIDDEN MARKOV 

MODELS 



Field Of The Invention 

This invention relates to pattern classification, and more particularly, to 
5 a method and apparatus for classifying unknown speech utterances into 
specific word categories. 



Background Of The Invention 

Speech recognition is the process of analyzing an acoustic speech 
signal to identify the linguistic message or utterance that was intended so that 

10 a machine can correctly respond to spoken commands. Fluent conversation 
with a machine is difficult because of the intrinsic variability and complexity of 
speech. Speech recognition difficulty is also a function of vocabulary size, 
confusability of words, signal bandwidth, noise, frequency distortions, the 
population of speakers that must be understood, and the form of speech to be 

15 processed. 

A speech recognition device requires the transformation of the 
continuous signal into discrete representations which may be assigned proper 
meanings, and which, when comprehended, may be used to effect responsive 
behavior. The same word spoken by the same person on two successive 

20 occasions may have different characteristics. Pattern classifications have 
been developed to help in determining which word has been uttered. 

Several methods of pattern classification have been utilized in the field 
of speech recognition. Currently, the hidden Markov model is the most popular 
statistical model for speech. The hidden Markov model characterizes speech 

25 signals as a stochastic state, or random distribution, machine with different 

characteristic distributions of speech signals associated with each state. Thus, 
a speech signal can be viewed as a series of acoustic sounds, where each 
sound has a particular pattern of harmonic features. However, there is not a 
one-to-one relationship between harmonic patterns and speech units. Rather, 

30 there is a random and statistical relationship between a particular speech 
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sound and a particular harmonic pattern. In addition, the duration of the 
sounds in speech does not unduly effect the recognition of whole words. The 
hidden Markov model captures both of these aspects of speech, where each 
state has a characteristic disthbution of harmonic patterns, and the transitions 

5 from state to state describe the durational aspects of each speech sound. 

The algorithms for designing hidden Markov models from a collection of 
sample utterances of words and sounds are widely known and disclosed in a 
book by Lawrence R. Rabiner and Biing Hwang Juang entitled "Fundamentals 
of Speech Recognition" Prentice-Hall, Inc. Englewood Cliffs, 1993, which is 

10 herein incorporated by reference. A method, commonly referred to as Baum- 
Welch re-estimation, allows one to continually refine models of spoken words. 
Once the statistical parameters of the model have been estimated, a simple 
formula exist for computing the probability of a given utterance being produced 
by the trained model. This latter algorithm is used in the design of isolated 

15 word speech recognition systems. By designing a collection of models, one for 
each word, one can then use the generated probability estimates as the basis 
for deciding the most likely word model to match to a given utterance. 

The decision tree classifier is another pattern classification technique. 
Decision tree classifiers imply a sequential method of determining which class 

20 to assign to a particular observation. Decision tree classifiers are most 

commonly used in the field of artificial intelligence for medical diagnosis and in 
botany for designing taxonomic guides. A decision tree can be described as a 
guide to asking a series of questions, where the latter questions asked depend 
upon the answers to earlier questions. For example, in developing a guide to 

25 identifying bird species, a relevant early question is the color of the bird. 

Decision tree design has proven to be a complex problem. The optimal 
decision tree classifier for a particular classification task can only be found by 
considering all possible relevant questions and all possible orders of asking 
them. This is a computationally impossible task even for situations where a 

30 small number of categories and features exist. Instead, methods of designing 
near optimal decision trees have been proposed using the measurements 
defined by the field of information theory. 
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Earlier attempts at designing decision tree classifiers for speech 
recognition using conventional methods proved unsuccessful. This is because 
an unrealistically vast quantity of training utterances would be required for 
accurate characterization of speech signals. This is due to the fact that the 

5 decision tree conditions questions based upon the responses to earlier 
questions. In any finite training set, *and with each question, the number of 
examples of a particular utterance meeting all of the necessary criteria 
imposed by earlier questions dwindles rapidly. Thus, as observations 
accumulate with deeper nodes of the decision tree, the estimates used in 

10 designing the decision tree become increasingly inaccurate. 

For the application of isolated word recognition, a specific class of 
hidden Markov models is employed, commonly referred to as discrete output 
hidden Markov models. For discrete output hidden Markov models, the spoken 
words are represented by a sequence of symbols from an appropriately 

15 defined alphabet of symbols. The method used to transform a set of acoustic 
patterns into a set of discrete symbols is known in the art as vector 
quantization. 

Prior art implementations of speech recognition systems using hidden 
Markov models considered the contribution of ail parts of the utterance 

20 simultaneously. The prior art required that the feature extraction and vector 
quantization steps be performed on the entire utterance before classification 
can begin. In addition, hidden Markov model computations require complex 
mathematical operations in order to estimate the probabilities of particular 
utterance patterns. These operations include the use of the complex 

25 representation of floating point numbers as well as large quantities of 
multiplications, divisions, and summations. 



Summarv Of The Invention 
This invention overcomes the disadvantage of the prior art by providing 
a speech recognition system that does not use complex mathematical 
30 operations and does not take a great deal of time to classify utterances into 
word categories. 
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A decision tree classifier, designed using the hidden Markov model, 
yields a final classifier that can be implemented without using any 
mathematical operations. This invention moves all of the mathematical 
calculations to the construction, or design, of the decision tree. Once this is 
completed, the decision tree can be implemented using only logical operations 
such as memory addressing and binary comparison operations. This simplicity 
allows the decision tree to be implemented directly in a simple hardware form 
using conventional logic gates or in software as greatly reduced set of 
computer instructions. This differs greatly from previous algorithms which 
required that a means of mathematical calculation of probabilities be provided 
in the classification system. 

This invention overcomes the problems of earlier decision tree classifier 
methods by introducing the hidden Markov modeling technique as an 
intermediate representation of the training data. The first step is obtaining a 
collection of examples of speech utterances. From these examples, hidden 
Markov models corresponding to each word are trained using the Baum-Welch 
method. 

Once the models of the speech utterances have been trained, the 
training speech is discarded. Only the statistical models that correspond to 
each word are used in designing the decision tree. The statistical models are 
considered a smoothed version of the training data. This smoothed 
representation helps the decision tree design prevent over-specialization. The 
hidden Markov model imposes a rigid parametric structure to the probability 
space, preventing variability in the training data from ruining the tree design 
process. Estimates for unobserved utterances are inferred based on their 
similarity to other examples in the training data set. 

In order to employ the information theoretic method of designing 
decision trees, it is necessary to derive new methods for estimating the 
probability values for specific sequences of acoustically labeled sounds. 
These new methods differ from earlier hidden Markov model algorithms in that 
they must yield probability values for sequences of sounds that are only 
partially specified. From a probabilistic perspective, the solution is to sum up 
the probabilities for all possible sequences of sounds for all sounds not 
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specified. However, this method is computationally prohibitive. Instead, a 
modification to the hidden Markov model algorithm known as the forward 
algorithm can be employed to compute the probabilities within a reasonable 
amount of time. 

5 Using the techniques discussed above, it is possible to design the 

decision tree before the finical classification task. This is desirable, as this 
eliminates (virtually) ail computations from the classification task. 

Brief Description Of The Drawings 

FIG. 1 is a block drawing of a speech recognition system that was 
10 utilized by the prior art; 

FIG. 2 is a block diagram of the apparatus of this invention; 
FIG. 3 is the decision tree; 

FIG. 4 is a table of the time index and corresponding vector quantizer 
symbols for sample utterances; and 
15 FIG- 5 is a block diagram of a decision tree design algorithm. 

Description Of The Preferred Embodiment 

Referring now to the drawings in detail and more particularly to FIG. 1: 
the reference character 1 1 represents a prior art speech recognition system for 
the words "yes", "no" and "help". The words "yes", "no" and "help" have been 

20 selected for illustrative purposes. Any words or any number of words may 
have been selected. Speech recognition system 1 1 compnses: a feature 
extractor 12; a vector quantizer 13; a hidden Markov Model (HMM) for the word 
"yes" 14; a hidden Markov Model (HMM) for the word "no" 15; a hidden Markov 
Model (HMM) for the word "help" 16 and a choose maximum 17 

25 The words "yes", "no" and "help" may be spoken into speech input 

device 18. Typically speech input device will consist of a microphone, 
amplifier and a A/D converter (not shown). The output of device 18 will be 
digitally sampled speech at a rate of approximately 12 Khz and a 16 bit 
resolution. The output of device 18 is coupled to the input of feature extractor 

30 12. 
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Feature extractor 12 consists of a frame buffer and typical speech 
recognition preprocessing. Extractor 12 is disclosed in a book by Lawrence R. 
Rabiner and Biing Hwang Juang entitled "Fundamentals of Speech 
Recognition/' Prentice-Hall, inc., Englewood Cliffs. 1993. which is herein 
5 incorporated by reference. An implementation entails the use of a frame buffer 
of 45 ms with 30 ms of overlap between frames. Each frame is processed 
using a pre-emphasis filter, a Hamming windowing operation, an 
autocorrelation measurement with order 10, calculation of the Linear Prediction 
Coefficients (LPC's) followed by calculation of the LPC cepstrat parameters. 
10 Cepstral parameters provide a complete source-filter characterization of the 
speech. The cepstral parameters are a representation of the spectral contents 
of the spoken utterance and contain information that includes the location and 
bandwidths of the formants of the vocal tract of the speaker. An energy term 
provides additional information about the signal amplitude. The output of 
15 extractor 12 are the aforementioned LPC cepstral parameters of order 10 and 
a single frame energy term that are coupled to the input of vector quantizer 13. 

Vector quantizer 13 utilizes the algohthm disclosed by Rabiner and 
Juang in "Fundamentals of Speech Recognition" to map each collection of 1 1 
features received from extractor 12 into a single integer. The above vector 
20 quantizer 13 is developed using a large quantity of training speech data. The 
integer sequence that constitutes a discrete version of the speech spectrum is 
then coupled to the inputs of HMM for the word "yes" 14, HMM for the word 
"no" 15 and HMM for the word "help" 16. HMM 14, HMM 15 and HMM 16 each 
individually contains a set of hidden Markov model parameters. When a 
25 sequence of integers representing a single spoken word is presented, the 
mathematical operations necessary to compute the observation sequence 
probability are performed. In the notation of speech recognition, this algorithm 
is known as the fonA^ard algorithm. HMM 14, HMM 15 and HMM 16 performs 
their own computations. These computations are time consuming. 
30 The outputs of HMM 14. HMM 15 and HMM 16 are coupled to the input 

of choose maximum 17. Choose maximum 17 is a conventional recognizer 
that utilizes a computer program to determine whether HMM 14, HMM 15, or 
HMM 16 has the highest probability estimate. If HMM 14 had the highest 
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probability esiimate. then the system concludes the word "yes' was spoken 
into device 18 and if HMM 15 had the highest probability estimate the word 
"no" was spoken into device 18 The output of choose maximum 17 is utilized 
to effect some command input to other systems to place functions in these 
5 other systems under voice control. 

FIG. 2 is a block diagram of the apparatus of this invention. The output 
of speech input device 18 is coupled to the input of feature extractor 12 and 
the output of feature extractor 12 is coupled to the input of vector quantizer 13. 
The output of vector quantizer 1 3 is coupled to the input of decision tree 

10 classifier 20. 

Decision tree classifier 20 has a buffer that contains the vector 
quantizer values of the entire spoken utterance, i.e. "yes", "no'\ "help". The 
decision tree classifier 20 may be represented by the data contained in the 
table shown in FIG. 3 and the associated procedure for accessing the 

15 aforementioned table. The table shown in FIG. 3 is generated using the 
statistical information collected in the hidden Markov models for each of the 
words in the vocabulary, i.e. " yes", "no" and "help\ FIG 4 contains several 
examples of spoken utterances for illustrative purposes, and will be used in the 
discussions that follow. 

20 The method used to classify utterances with decision tree classifier 20 

involves a series of inspections of the symbols at specific time instances, as 
guided by the information in FIG 3. We begin with step SO for each 
classification task. This table instructs us to inspect time index 3 and to 
proceed to other steps based on the symbols found at time index 3, or to 

25 announce a final classification when no further inspections are required. We 
will demonstrate this procedure for three specific utterances chosen from FIG 
4. 

We always begin the decision tree shown in FIG. 3 at step SO. This 
line of the decision tree tells us to look at the vector quantizer symbol 
30 contained at time index 3 (FIG. 4) for any spoken word. Let us assume that 
the word "no" was spoken. The symbol at time index 3 for the first word "no" is 
7. The column for symbol 7 (FIG. 3) at time 3 tells us to proceed to step S3. 
This step tells us to look at time index 6 (FIG. 4). The word "no" has a symbol 
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10 



15 



7 at time index 6. The symbol 7 column of step S3 (FIG. 3) tells us to proceed 
to step S8. Step S8 tells us to refer to time index 9 (FIG. 4). At time index 9 
for the word "no", we find a symbol value of 8, which informs us to go to column 

8 of step 8 (FIG. 3). This tells us to classify the answer as a ' no", which is 
correct. 

If the word "yes" was chosen at step SO of FIG. 3. This line of the table 
tells us to look at time index 3 (FIG. 4) for our chosen word. The symbol at 
time index 3 for the first sample utterance for the word "yes" is 7. The column 
for symbol 7 (FIG. 3) at time 3 tells us to proceed to step S3. This step tells us 
to look at time index 6 (FIG. 4). The word "yes" has a symbol 3 at time index 6. 
The symbol 3 column of step S3 (FIG. 3) tells us to classify this input, 
correctly, as the word "yes." 

If the word "help" was spoken at step SO of FIG. 3, this line of the table 
tells us to look at time index 3 (FIG. 4) for our chosen word. The symbol at 
time index 3 for the word '^help" is 6. The column for symbol 6 (FIG. 3) at time 
3 tells us to proceed to step S2. This step tells us to look at time index 4 (FIG. 
4). The word "help" has a symbol 6 at time index 4 The symbol 6 column of 
step 82 (Fig. 3) tells us that the word "help" should be selected. 

The "EOW" column of FIG. 3 tells us where to go when the end of a 
word has been reached at the time index requested for observation. The label 
"n/a" was included for utterances too short to qualify as any of the words in the 
training set. 

The decision table shown in FIG. 3 can be used to correctly classify all 
of the examples presented in FIG. 4. However, the above decision tree and 

25 examples would not perform well on a large data set. as it is far too small to 
capture the diversity of potential utterances. Real implementations of the 
algorithm, which are hereinafter described, would generate tables with 
thousands of lines and at least several dozen columns. The above example is 
meant to illustrate the output of the hidden Markov model conversion to 

30 decision tree algorithm, and the procedure implemented in decision tree 
classifier 20 that uses the resultant data table. 

FIG. 5 is a flow chart of a decision tree design algorithm. General 
decision tree design principles are disclosed by Rodney M. Goodman, and 



20 
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Padhraic Smyth in their article entitled "Decision tree design from a 
communications theory standpoint", IEEE Transactions on Information Theory, 
Vol., 34, No. 5, pp. 979-997 Sept. 1988 which is herein incorporated by 
reference. The decision tree is designed using an iterative process. The 

5 precise rule applied for the expansion of each branch of the decision tree is 
specified. The desired termination condition depends highly on specific needs 
of the application, and will be discussed in broader terms. 

In the preferred implementation, a greedy algorithm is specified where 
the termination condition is the total number of nodes in the resulting tree. 

10 This is directly related to the amount of memory available for the storage of the 
classifier data structures. Thus, this termination condition is well suited to 
most practical implementations. 

In order to design the tree, the measure the mutual information between 
an arbitrary set of observations and the word category is needed. To find the 

15 mutual information, it is necessary to compute the probability of observing a 
partially specified sequence of vector quantizer labels This is accomplished 
using a modified version of the hidden Markov model equations, contained in 
block 21 for the word "yes\ block 22 for the word "no" and block 23 for the 
word "help", normally referred to as the forward algorithm. This modification 

20 involves removing the terms in the forward algorithm associated with 

unspecified vector quantizer outputs and replacing their probability values with 
one (1) in the conventional hidden Markov model algorithm. 

This algorithm is contained in box 24 and is repeated iteratively, 
converting the terminal, or leaf, nodes of the decision tree into internal nodes 

25 one at a time, and adding new leaf nodes in proportion to the number of entries 
in the vector quantizer. This continues until the prespecified maximum number 
of tree nodes has been reached. The resulting data structures can then be 
organized into the, decision tree as has been shown in FIG. 3, and stored in 
the decision tree classifier 20. 

30 The following decision tree design algorithm may be utilized in block 24: 

Before the decision tree algorithm can begin, a collection of discrete output 
hidden Markov models must be trained for each word in the recognition 
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vocaoulary using a suitable training data set. The hidden Markov model 
parameters for a vocabulary of size WV^will be denoted as 
A . state transition matrix for word / 

/ = /i 2 W 

state output distribution matrix for word / \*^y^y*'*9^^ 
/T, initial state distribution for word / 

We further assume that the number of different outputs from the vector 
quantizer is denoted by Q, also known as the vector quantizer codebook size. 
In addition, we will define the notation for a set of vector Quantizer outnuts at 
different time indices as 

^={M{'.,^2) {'„.S,.)} 

where the set X consists of pairs of observed symbols and their associated 
time index. 

The tree design algorithm proceeds as follows: Two sets of sets are 
defined. The first set, 7, is the collection of observation sets associated with 

15 internal tree nodes. The second set, L, is the collection of observation sets 
associated with leaf nodes of the tree. Initially, the set 7 is set to the empty 
set, and L is a set that contains one element, and that element is the empty 
set. The main loop of the algorithm moves elements from the set L to the set 7 
one at a time until the cardinality, or number of elements, in set 7 reaches the 

20 pre-specified maximum number. 

The determination of which set contained in L should be moved to set 7 
at any iteration is determined using an information theoretic criterion. Here, 
the goal is to reduce the total entropy of the tree specified by the collection of 
leaf nodes in L. The optimal greedy selection for the collection of observation 

25 sets is given by 



where 



max H(h;|x)p(X) 



wi.h HM^) = -2p('"|A-)1o£3P(v.|^) 
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p(^ 2pW'")pO") 

The probability of observing a collection of vector quantizer outputs 
conditioned on the word w is computed using the hidden Markov model 
equations. One expression of this calculation is given by 

P{X\w = 0= Z |nB,W'y)..S;]T^,Wl)]ilA,[A:(0,;c:(/ + l)] 

v^hich can be efficiently calculated using a slightly modified version of the 
hidden Markov model equation commonly referred to as the forward algorithm. 

With each iteration, the optimal set X contained in L is moved to the set 
7. In addition, the optimal set X is used to compute the optimal time index for 
10 further node expansion. This is done using an information theoretic criteria 
specified by q 

arg min ^ H(vii X <j {(/, s)}) ?{X ^ {(/, s)}) 

The time index specified by this relationship is used to expand the 
collection of leaf nodes to include ail possible leaf nodes associated with the 

15 set X. Each of these new sets, which are designated by 

L<-L^{X^ ((/. s)}} Vs G {1,2, . . . , Q} 

are added to the set L and the set X is removed from set L. Once this 
operation is completed, the algorithm repeats by choosing the best element of 
L to transfer to 7, etc. This algorithm is expressed in pseudocode in FIG 6. 

20 The above specification describes a new and improved apparatus and 

method for classifying utterances into word categories. It Is realized that the 
above description may indicate to those skilled in the art additional ways in 
which the principles of this invention may be used without departing from the 
spirit. It is, therefore, intended that this invention be limited only by the scope 

25 of the appended claims. 
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WHAT IS CLAIMED IS: 

1 . A method of classifying utterance which comprises the steps of: 
obtaining a collection of examples of speech utterances; 
training hidden Markov models to obtain statistical models that 

correspond to individual words to classify the speech utterances; 

using the statistical models to design a decision tree that represents 
a classifier of the speech utterances; 

using the hidden Markov model to compute the probabilities of 
sounds not uttered; and 

determining speech utterances with the decision tree. 

2. The method claimed in 1. wherein the statistical models that are 
used to design the decision tree are implemented in a greedy iterative fashion. 

3. The method claimed in Claim 2, wherein the recursive greedy 
fashion of implementing the decision tree design includes the steps of: 

iteratively expanding entries of the decision tree table using the 
mutual information between the observed symbols and the word class; and 

terminating the decision table when the maximum number of entries 
is reached. 

4. The method claimed in Claim 3. wherein the mutual information Is 
obtained in accordance with the equation 

Ite A^) = H(w)-H(]ii| X) 

5. The method claimed in Claim 3 further including the step of: 
using the maximum expected information gain to determine the 

optimal time index to expand each entry in the decision tree table. 

6. A system for classifying utterances which comprises: 
means for imputing speech; 

means coupled to said inputting means for transforming the speech 
into an alphabet of acoustic patterns; 

SUBSTITUTE SHEET (RULE 26) 
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means coupled to said transforming means for classifying the 
acoustic patterns into word categories by utilizing a decision tree classifier 
designed using hidden Markov models. 

7 The system claimed in Claim 6, wherein said classifying means 
comprises: 

a memory containing classification instructions; 

means coupled to said memory for sequentially accessing said 
memory based upon the acoustic patterns; and 

means coupled to said accessing means for outputting the terminal 
classification decision. 

8. The system claimed in Claim 6. wherein said transforming means 
comprises: 

a feature extractor for converting the inputted speech signals into 
frame based acoustical patterns; and 

a vector quantizer coupled to said feature extractor for mapping the 
acoustical patterns to a fixed alphabet of symbols. 
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